AITopics | quality label

Collaborating Authors

quality label

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

StableMotion: Training Motion Cleanup Models with Unpaired Corrupted Data

Mu, Yuxuan, Ling, Hung Yu, Shi, Yi, Ojeda, Ismael Baira, Xi, Pengcheng, Shu, Chang, Zinno, Fabio, Peng, Xue Bin

arXiv.org Artificial IntelligenceSep-16-2025

Motion capture (mocap) data often exhibits visually jarring artifacts due to inaccurate sensors and post-processing. Cleaning this corrupted data can require substantial manual effort from human experts, which can be a costly and time-consuming process. Previous data-driven motion cleanup methods offer the promise of automating this cleanup process, but often require in-domain paired corrupted-to-clean training data. Constructing such paired datasets requires access to high-quality, relatively artifact-free motion clips, which often necessitates laborious manual cleanup. In this work, we present StableMotion, a simple yet effective method for training motion cleanup models directly from unpaired corrupted datasets that need cleanup. The core component of our method is the introduction of motion quality indicators, which can be easily annotated - through manual labeling or heuristic algorithms - and enable training of quality-aware motion generation models on raw motion data with mixed quality. At test time, the model can be prompted to generate high-quality motions using the quality indicators. Our method can be implemented through a simple diffusion-based framework, leading to a unified motion generate-discriminate model, which can be used to both identify and fix corrupted frames. We demonstrate that our proposed method is effective for training motion cleanup models on raw mocap data in production scenarios by applying StableMotion to SoccerMocap, a 245-hour soccer mocap dataset containing real-world motion artifacts. The trained model effectively corrects a wide range of motion artifacts, reducing motion pops and frozen frames by 68% and 81%, respectively. Results and code are available at https://yxmu.foo/stablemotion-page

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.03154

Country:

North America (0.29)
Europe (0.28)
Asia > China (0.16)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.49)

Add feedback

MeshFleet: Filtered and Annotated 3D Vehicle Dataset for Domain Specific Generative Modeling

Boborzi, Damian, Mueller, Phillip, Emrich, Jonas, Schmid, Dominik, Mueller, Sebastian, Mikelsons, Lars

arXiv.org Artificial IntelligenceMar-18-2025

Generative models have recently made remarkable progress in the field of 3D objects. However, their practical application in fields like engineering remains limited since they fail to deliver the accuracy, quality, and controllability needed for domain-specific tasks. Fine-tuning large generative models is a promising perspective for making these models available in these fields. Creating high-quality, domain-specific 3D datasets is crucial for fine-tuning large generative models, yet the data filtering and annotation process remains a significant bottleneck. We present MeshFleet, a filtered and annotated 3D vehicle dataset extracted from Objaverse-XL, the most extensive publicly available collection of 3D objects. Our approach proposes a pipeline for automated data filtering based on a quality classifier. This classifier is trained on a manually labeled subset of Objaverse, incorporating DINOv2 and SigLIP embeddings, refined through caption-based analysis and uncertainty estimation. We demonstrate the efficacy of our filtering method through a comparative analysis against caption and image aesthetic score-based techniques and fine-tuning experiments with SV3D, highlighting the importance of targeted data selection for domain-specific 3D generative modeling.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.14002

Country:

North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report (0.82)

Industry:

Automobiles & Trucks (0.93)
Transportation > Ground > Road (0.68)
Transportation > Passenger (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Finding a Wolf in Sheep's Clothing: Combating Adversarial Text-To-Image Prompts with Text Summarization

Cooper, Portia, Narnoli, Harshita, Surdeanu, Mihai

arXiv.org Artificial IntelligenceDec-15-2024

Text-to-image models are vulnerable to the stepwise "Divide-and-Conquer Attack" (DACA) that utilize a large language model to obfuscate inappropriate content in prompts by wrapping sensitive text in a benign narrative. To mitigate stepwise DACA attacks, we propose a two-layer method involving text summarization followed by binary classification. We assembled the Adversarial Text-to-Image Prompt (ATTIP) dataset ($N=940$), which contained DACA-obfuscated and non-obfuscated prompts. From the ATTIP dataset, we created two summarized versions: one generated by a small encoder model and the other by a large language model. Then, we used an encoder classifier and a GPT-4o classifier to perform content moderation on the summarized and unsummarized prompts. When compared with a classifier that operated over the unsummarized data, our method improved F1 score performance by 31%. Further, the highest recorded F1 score achieved (98%) was produced by the encoder classifier on a summarized ATTIP variant. This study indicates that pre-classification text summarization can inoculate content detection models against stepwise DACA obfuscations.

classifier, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2412.12212

Country:

North America > United States > Arizona > Pima County > Tucson (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Monaco (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry:

Media (0.46)
Leisure & Entertainment (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)

Add feedback

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Liu, Ziche, Ke, Rui, Jiang, Feng, Li, Haizhou

arXiv.org Artificial IntelligenceJun-20-2024

Data selection for fine-tuning Large Language Models (LLMs) aims to select a high-quality subset from a given candidate dataset to train a Pending Fine-tune Model (PFM) into a Selective-Enhanced Model (SEM). It can improve the model performance and accelerate the training process. Although a few surveys have investigated related works of data selection, there is a lack of comprehensive comparison between existing methods due to their various experimental settings. To address this issue, we first propose a three-stage scheme for data selection and comprehensively review existing works according to this scheme. Then, we design a unified comparing method with ratio-based efficiency indicators and ranking-based feasibility indicators to overcome the difficulty of comparing various models with diverse experimental settings. After an in-depth comparative analysis, we find that the more targeted method with data-specific and model-specific quality labels has higher efficiency, but the introduction of additional noise information should be avoided when designing selection algorithms. Finally, we summarize the trends in data selection and highlight the short-term and long-term challenges to guide future research.

data selection, dataset, indicator, (15 more...)

arXiv.org Artificial Intelligence

2406.14115

Country:

North America > United States > California > San Francisco County > San Francisco (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

Add feedback

TasTe: Teaching Large Language Models to Translate through Self-Reflection

Wang, Yutong, Zeng, Jiali, Liu, Xuebo, Meng, Fandong, Zhou, Jie, Zhang, Min

arXiv.org Artificial IntelligenceJun-12-2024

Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks. Techniques like instruction tuning have effectively enhanced the proficiency of LLMs in the downstream task of machine translation. However, the existing approaches fail to yield satisfactory translation outputs that match the quality of supervised neural machine translation (NMT) systems. One plausible explanation for this discrepancy is that the straightforward prompts employed in these methodologies are unable to fully exploit the acquired instruction-following capabilities. To this end, we propose the TasTe framework, which stands for translating through self-reflection. The self-reflection process includes two stages of inference. In the first stage, LLMs are instructed to generate preliminary translations and conduct self-assessments on these translations simultaneously. In the second stage, LLMs are tasked to refine these preliminary translations according to the evaluation results. The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods. Our work presents a promising approach to unleash the potential of LLMs and enhance their capabilities in MT. The codes and datasets are open-sourced at https://github.com/YutongWang1216/ReflectionLLMMT.

large language model, machine learning, translation, (18 more...)

arXiv.org Artificial Intelligence

2406.08434

Country: Asia > China (0.68)

Genre: Research Report > Promising Solution (0.34)

Industry:

Information Technology (0.68)
Energy > Oil & Gas (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia

Ando, Kenichiro, Sekine, Satoshi, Komachi, Mamoru

arXiv.org Artificial IntelligenceDec-29-2023

Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poor-quality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive dataset for studying it does not currently exist. Here, we propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia. Each sentence is extracted from the entire revision history of English Wikipedia, and the target quality labels were carefully investigated and selected. WikiSQE has about 3.4 M sentences with 153 quality labels. In the experiment with automatic classification using competitive machine learning models, sentences that had problems with citation, syntax/semantics, or propositions were found to be more difficult to detect. In addition, by performing human annotation, we found that the model we developed performed better than the crowdsourced workers. WikiSQE is expected to be a valuable resource for other tasks in NLP.

category, quality label, wikipedia, (15 more...)

arXiv.org Artificial Intelligence

2305.05928

Country:

North America > United States > Mississippi > Attala County (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > Canada (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Communications > Social Media > Crowdsourcing (0.34)

Add feedback

Towards Sustainable Development: A Novel Integrated Machine Learning Model for Holistic Environmental Health Monitoring

Mazumder, Anirudh, Engala, Sarthak, Nallaparaju, Aditya

arXiv.org Artificial IntelligenceAug-20-2023

Urbanization enables economic growth but also harms the environment through degradation. Traditional methods of detecting environmental issues have proven inefficient. Machine learning has emerged as a promising tool for tracking environmental deterioration by identifying key predictive features. Recent research focused on developing a predictive model using pollutant levels and particulate matter as indicators of environmental state in order to outline challenges. Machine learning was employed to identify patterns linking areas with worse conditions. This research aims to assist governments in identifying intervention points, improving planning and conservation efforts, and ultimately contributing to sustainable development.

artificial intelligence, dataset, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2308.10317

Country: North America > United States > Texas > Denton County > Denton (0.05)

Genre: Research Report (1.00)

Industry:

Health & Medicine (1.00)
Water & Waste Management (1.00)
Law > Environmental Law (0.89)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Identifying High-Quality Chinese News Comments Based on Multi-Target Text Matching Model

Chen, Deli, Ma, Shuming, Yang, Pengcheng, Sun, Xu

arXiv.org Artificial IntelligenceAug-21-2018

With the development of information technology, there is an explosive growth in the number of online comment concerning news, blogs and so on. The massive comments are overloaded, and often contain some misleading and unwelcome information. Therefore, it is necessary to identify high-quality comments and filter out low-quality comments. In this work, we introduce a novel task: high-quality comment identification (HQCI), which aims to automatically assess the quality of online comments. First, we construct a news comment corpus, which consists of news, comments, and the corresponding quality label. Second, we analyze the dataset, and find the quality of comments can be measured in three aspects: informativeness, consistency, and novelty. Finally, we propose a novel multi-target text matching model, which can measure three aspects by referring to the news and surrounding comments. Experimental results show that our method can outperform various baselines by a large margin on the news dataset.

information, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

1808.07191

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > United States > Nevada (0.04)
(10 more...)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback